Using Quixotic to trick llm scrapers

2025-08-16

Why?

I don´t like the hype around llm's. I do think they have some good use cases and don´t want to make a blanket statement saying that all llm usage is bad. But I think the way large corporations are currently using them is very harmful to the world. They use all of our intellectual property for their monetary gain, without giving anything back to the original authors. All while wasting huge amounts of resources and making the internet a worse place.

So I will strive to prevent OpenAI, MS, ByteDance, Meta, etc. to make money from this blog. Maybe that won´t work, but we can give it a try and have some fun in the process.

But how?

In this Hacker News thread about an LLM bot trap called Nepenthes, someone mentioned they wrote a tool called Quixotic, which garbles the contents of a static site using Markov chains. This seemed ideal for this site, whenever I post a new article I could simply regenerate the garbled version of the site and serve that to the LLM scraper bots.

Generating the garbled site

On the (Debian) machine I edit the website on I built Quixotic like so:

- apt install cargo
- apt install git
- git clone https://github.com/marcus0x62/quixotic
- cd quixotic
- cargo build --release

I didn´t use "cargo install" to install it, as mentioned on the quixotic website, as I noticed I could just run the executable from the ./quixotic/target/release/ directory, which is good enough.

Note: I don't know the slightest thing about rust or cargo. The way I did things might not be the best.

Then I just let it go to work on my site like this:
./quixotic --input /home/user/Documents/BlinkyCursor/ --output /home/user/Documents/BlinkyQuix/ --percent 0.40

Now Quixotic made a garbled version of my site under /home/user/Documents/BlinkyQuix/. You can see the result here, I think it's quite funny.

Note: The first time I ran quixotic, it did nothing. It just reproduced the original site in the output directory. I spent some time figuring out why, eventually taking a look at the source code. Although I don´t know rust, I knew enough programming to notice it only processes .html files and not .htm files. So the site files were renamed to the correct extension and then everything worked fine.

Serving the garbage

Of course normal people need to see the normal site, and bots need to see the garbled version. The Quixotic website has a nice explanation on how to do that for the Apache webserver, which is what this site is running on.

I used the apache rewrite module to redirect requests from known bot user-agents to a directory with the garbled version of the website. This is the apache virtual host config that works for me:

<VirtualHost *:443>
    ServerName blinkycursor.net
    <If "%{HTTP_USER_AGENT} in { 'AI2Bot', 'Ai2Bot-Dolma', 'aiHitBot', 'Amazonbot', 'Andibot', 'anthropic-ai', 'Applebot', 'Applebot-Extended', 'Awario', 'bedrockbot', 'Brightbot 1.0', 'Bytespider', 'CCBot', 'ChatGPT-User', 'Claude-SearchBot', 'Claude-User', 'Claude-Web', 'ClaudeBot', 'cohere-ai', 'cohere-training-data-crawler', 'Cotoyogi', 'Crawlspace', 'Datenbank Crawler', 'Devin', 'Diffbot', 'DuckAssistBot', 'Echobot Bot', 'EchoboxBot', 'FacebookBot', 'facebookexternalhit', 'Factset_spyderbot', 'FirecrawlAgent', 'FriendlyCrawler', 'Gemini-Deep-Research', 'Google-CloudVertexBot', 'Google-Extended', 'GoogleAgent-Mariner', 'GoogleOther', 'GoogleOther-Image', 'GoogleOther-Video', 'GPTBot', 'iaskspider/2.0', 'ICC-Crawler', 'ImagesiftBot', 'img2dataset', 'ISSCyberRiskCrawler', 'Kangaroo Bot', 'meta-externalagent', 'Meta-ExternalAgent', 'meta-externalfetcher', 'Meta-ExternalFetcher', 'MistralAI-User', 'MistralAI-User/1.0', 'MyCentralAIScraperBot', 'netEstate Imprint Crawler', 'NovaAct', 'OAI-SearchBot', 'omgili', 'omgilibot', 'Operator', 'PanguBot', 'Panscient', 'panscient.com', 'Perplexity-User', 'PerplexityBot', 'PetalBot', 'PhindBot', 'Poseidon Research Crawler', 'QualifiedBot', 'QuillBot', 'quillbot.com', 'SBIntuitionsBot', 'Scrapy', 'SemrushBot-OCOB', 'SemrushBot-SWA', 'Sidetrade indexer bot', 'Thinkbot', 'TikTokSpider', 'Timpibot', 'VelenPublicWebCrawler', 'WARDBot', 'Webzio-Extended', 'wpbot', 'YaK', 'YandexAdditional', 'YandexAdditionalBot', 'YouBot' }" >
        RewriteEngine on
        RewriteCond  "/var/www/blinkycursor/quix/%{REQUEST_URI}" -f
        RewriteRule "^(.*)$" "/quix/%{REQUEST_URI}" [L]
    </If>
    DocumentRoot /var/www/blinkycursor
    ...

This is only slightly different from the example configuration on the quixotic website. You can test if it works by changing your browser's user-agent to TikTokSpider or something and browsing to this site.

Automatic updates

Now, while this works well, I don´t want to manually update my apache config every time someone builds a new LLM scraper bot. In the Hacker News thread, it was also mentioned that a nice list of bots is maintained on the ai.robots.txt github repo. So a small python script was whipped up that downloads the list and updates the apache config automatically.

bot-updater.py

import requests
import os
import warnings

bots_url = 'https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.json'
config_file = '/etc/apache2/sites-available/blinkycursor.net.conf'
config_file_keyword = '    <If "%{HTTP_USER_AGENT} in { '
file_changed = False

try:
    resp = requests.get(bots_url)
    json_response = resp.json()
except Exception as err:
    print("Could not get bot list from GitHub.", err)
    exit(1)

bot_list = list(json_response.keys())

for i in range(len(bot_list)):
    bot_list[i] = '\'' + bot_list[i] + '\''
    
joined_bot_list = ', '.join(bot_list)
config_string = '    <If "%{HTTP_USER_AGENT} in { ' + joined_bot_list + ' }" >\n'

print(config_string)

with open(config_file, 'r', encoding='utf-8') as file:
    file_lines = file.readlines()

for j in range(len(file_lines)):
    if(file_lines[j].startswith(config_file_keyword)):              
        print(file_lines[j])
        file_lines[j] = config_string
        file_changed = True
        break

if(file_changed):
    with open(config_file, 'w', encoding='utf-8') as file:
        file.writelines(file_lines)
        
    apache_status = os.system('systemctl is-active --quiet apache2.service')

    if(apache_status == 0):
        os.system('systemctl reload --quiet apache2.service')
else:
    warnings.warn('No changes made in Apache config. Something might be wrong.')

This downloads the bot list in json format and distills an apache config line out of it. Inserts that line into the Apache virtual host config and reloads Apache. I scheduled this to run every night using a systemd service and timer.

bot-updater.service

[Unit]
Description=Updates the Apache configuration with a new list of bots to block
Wants=bot-updater.timer

[Service]
Type=oneshot
ExecStart=/usr/bin/python3 -u /root/scripts/bot-updater.py

[Install]
WantedBy=multi-user.target

bot-updater.timer

[Unit]
Description=Timer for: bot-updater.service
Requires=bot-updater.service

[Timer]
Unit=bot-updater.service
# Run every night at 2 am
OnCalendar=*-*-* 02:00:00

[Install]
WantedBy=timers.target

This was my first time using systemd timers. I usually just use good old cron jobs, but figured I better learn to do it the more modern way.

The .service and .timer files should be placed in /etc/systemd/system after which you need to run "systemctl daemon-reload" to make systemd detect the changes. Once that`s done you can use "systemctl enable bot-updater.timer" to enable the timer and something like "journalctl -f -u bot-updater.service" to check what`s going on with the updater service.

Future improvements

Some stuff that could be added to further improve this includes:
- Make sure the Apache config is valid after the any changes using "apachectl configtest"
- Email notifications when the script generates an error